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Abstract 

We propose a framework for analyzing and comparing distributions, allowing us to design 
statistical tests to determine if two samples are drawn from different distributions. Our 
test statistic is the largest difference in expectations over functions in the unit ball of a 
reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation 
bounds for the test statistic, while a third is based on the asymptotic distribution of this 
statistic. The test statistic can be computed in quadratic time, although efficient linear 
time approximations are available. Several classical metrics on distributions are recovered 
when the function space used to compute the difference in expectations is allowed to be 
more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, 
including attribute matching for databases using the Hungarian marriage method, where 
they perform strongly. Excellent performance is also obtained when comparing distribu- 
tions over graphs, for which these are the first such tests. 



Keywords: Kernel methods, two sample test, uniform convergence bounds, schema 
matching, asymptotic analysis, hypothesis testing. 
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1. Introduction 

We address the problem of comparing samples from two probability distributions, by propos- 
ing statistical tests of the hypothesis that these distributions are different (this is called the 
two-sample or homogeneity problem). Such tests have application in a variety of areas. In 
bioinformatics, it is of interest to compare microarray data from identical tissue types as 
measured by different laboratories, to detect whether the data may be analysed jointly, or 
whether differences in experimental procedure have caused systematic differences in the data 
distributions. Equally of interest are comparisons between microarray data from different 
tissue types, either to determine whether two subtypes of cancer may be treated as sta- 
tistically indistinguishable from a diagnosis perspective, or to detect differences in healthy 
and cancerous tissue. In database attribute matching, it is desirable to merge databases 
containing multiple fields, where it is not known in advance which fields correspond: the 
fields are matched by maximising the similarity in the distributions of their entries. 

We test whether distributions p and q are different on the basis of samples drawn from 
each of them, by finding a well behaved (e.g. smooth) function which is large on the points 
drawn from p, and small (as negative as possible) on the points from q. We use as our test 
statistic the difference between the mean function values on the two samples; when this is 
large, the samples are likely from different distributions. We call this statistic the Maximum 
Mean Discrepancy (MMD). 

Clearly the quality of the MMD as a statistic depends on the class 3" of smooth functions 
that define it. On one hand, 9" must be "rich enough" so that the population MMD 
vanishes if and only if p = g. On the other hand, for the test to be consistent, 3" needs 
to be "restrictive" enough for the empirical estimate of MMD to converge quickly to its 
expectation as the sar nple size in c rease s. We shall use the unit balls in universal reproducing 
kernel Hilbert spaces (ISteinwartl . boOlh as our function classes, since these will be shown to 



satisfy both of the foregoing properties (we also review classical metrics on distributions, 
namely the Kolmogorov-Smirnov and Earth-Mover's distances, which are based on different 
function classes). On a more practical note, the MMD has a reasonable computational cost, 
when compared with other two-sample tests: given m points sampled from p and n from 
q, the cost is 0{m + n)^ time. We also propose a less statistically efficient algorithm 
with a computational cost of 0{m + n), which can yield superior performance at a given 
computational cost by looking at a larger volume of data. 

We define three non-parametric statistical tests based on the MMD. The first two, which 
use distribution-independent uniform convergence bounds, provide finite sample guarantees 
of test performance, at the expense of being conservative in detecting differences between 
p and q. The third test is based on the asymptotic distribution of the MMD, and is 
in practice more sensitive to differences i n distribution at smal l s ample sizes. The pr esent 



work synth e sizes and expands on results of lGretton et al.l (l20C)7aljbri.lSmola et al.l (120071 1 , and 
Song et al.l (|2008H ^ who in turn build on the earlier work of iBorgwardt et alj (l2006l). Note 



that th e latter addresses only the third kind of test, and that the approach of lGretton et al 



(l2007al lbh employs a more accurate approximation to the asymptotic distribution of the test 



statistic. 



In particular, most of the proofs here were not provided by [Gretton et al.l l|2007al ) 
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We begin our presentation in Section [2] with a formal definition of the MMD, and a 
proof that the population MMD is zero if and only ii p = q when 9" is the unit ball of a 
universal RKHS. We also review alternative function classes for which the MMD defines a 
metric on probability distributions. In Section [3l we give an overview of hypothesis testing 
as it applies to the two-sample problem, and review other approaches to this problem. We 
present our first two hypothesis tests in Section |H based on two different bounds on the 
deviation between the population and empirical MMD. We take a different approach in 
Section O where we use the asymptotic distribution of the empirical MMD estimate as the 
basis for a third test. When large volumes of data are available, the cost of computing the 
MMD (quadratic in the sample size) may be excessive: we therefore propose in Section [6] 
a modified version of the MMD statistic that has a linear cost in the number of samples, 
and an associated asymptotic test. In Section [71 we provide an overview of methods re- 
lated to the MMD in the statistics and machine learning literature. Finally, in Section [HI 
we demonstrate the performance of MMD-based two-sample tests on problems from neu- 
roscience, bioinformatics, and attribute matching using the Hungarian marriage method. 
Our approach performs well on high dimensional data with low sample size; in addition, we 
are able to successfully distinguish distributions on graph data, for which ours is the first 
proposed test. 



2. The Maximum Mean Discrepancy 

In this section, we present the maximum mean discrepancy (MMD), and describe conditions 
under which it is a metric on the space of probability distributions. The MMD is defined in 
terms of particular function spaces that witness the difference in distributions: we therefore 
begin in Section [2. II by introducing the MMD for some arbitrary function space. In Section 
12.21 we compute both the population MMD and two empirical estimates when the associated 
function space is a reproducing kernel Hilbert space, and we derive the RKHS function that 
witnesses the MMD for a given pair of distributions in Section 12. 3[ Finally, we describe the 
MMD for more general function classes in Section 12. 4[ 



2.1 Definition of the Maximum Mean Discrepancy 

Our goal is to formulate a statistical test that answers the following question: 

Problem 1 Let p and q be Borel probability measures defined on a domain X. Given 
observations X := {xi, . . . , Xm} and Y := {yi, . . . , yn}, drawn independently and identically 
distributed (i.i.d.) from p and q, respectively, can we decide whether p ^ q? 

To start with, we wish to determine a criterion that, in the population setting, takes on a 
unique and di stinctive value only when p = q. It will be defined based on Lemma 9.3.2 of 
DudlevI jiooi). 



Lemma 1 Let (X, d) be a metric space, and letp, q be two Borel probability measures defined 
on X. Then p = q if and only if 'Eixr^p{f {x)) = 'Eiy^q{f {y)) for all f G C(X), where C(X) is 
the space of bounded continuous functions on X. 

Although C (X) in principle allows us to identify p = q uniquely, it is not practical to work 
with such a rich function class in the finite sample setting. We thus define a more general 
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class of s tatistic, for as yet unspeci f ied function c lasses 3", to measure the disparity between 
p and q (jPortet and Mourieil . Il953l : iMiilleil . Il997l l. 



Definition 2 Let 3" 6e a class of functions / : X ^ M and let p, q, X, Y be defined as above. 
We define the maximum mean discrepancy (MMD) as 



MMD [:?,p, q] := sup (E,^p[/(x)] - E,^,[/(y)]) . 



Miille-^ h99i ) calls this an integral probability metric. A biased empirical estimate of the 
MMD is 



/ m n 
MMDb [3-, X, Y\ := sup - V /(x,) - - V f{yi 



(2) 



The empirical MMD defined above has an upward bias (we will define an unbiased statistic 
in the following section). We must now identify a function class that is rich enough to 
uniquely identify whether p = q, yet restrictive enough to provide useful finite sample 
estimates (the latter property will be established in subsequent sections). 

2.2 The MMD in Reproducing Kernel Hilbert Spaces 

If !J is the unit ball in a reproducing kernel Hilbert space the empirical MMD can be 
computed very efficiently. This will be the main approach we pursue in the present study. 
Other possible function classes 3" are discussed at the end of this section. We will refer to 
as universal whenever !K, defined on a compact metric space X and with associated kerne l 
A; : X^ ^ R, is dense in C(X) with respect to the Loo norm. It is shown in Isteinwarti ( 2001 ) 
that Gaussian and Laplace kernels are universal. We have the following result: 

Theorem 3 Let 3' be a unit ball in a universal RKHS 'K, defined on the compact metric 
space X, with associated kernel k{-, •). Then MMD [3", p, q] = if and only if p = q. 

Proof It is clear that MMD [3",p, is zero ii p = q. We prove the converse by showing 
that MMD [C{X),p, q] = D for some D > implies MMD [3',p, q] > 0: this is equivalent to 
MMD [3", p, g] = implying MMD [C{X),p,q] = (where this last result implies p = q hy 
Lemmadl noting that compactness of the metric space X implies its separability). Let !K be 
the universal RKHS of which J" is the unit ball. If MMD [C(X),p, q] = D, then there exists 



some / G C(X) for which Ep / 



E„ 



/ > D/2. We know that J{ is dense in C(X) with 



respect to the L^o norm: this means that for e = D/8, we can find some /* G !K satisfying 



/*-/ 



< e. Thus, we obtain 



Ep [/* 



\Ep[f*]-B,[f*]\ > Ep / 



E, 



E, 



/ 



/ 



< e and consequently 



2e > 



D 



D 



> 0. 



Finally, using 



< oo, we have 

[Ep [/*] - E, [/*]] /iir 11^ > D/ii iir 11^) > 0, 

and hence MMD ['J,p,q] > 0. 
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We now review some pr operties of J{ that wi l l allo w us to express the MMD in a more 
easily computable form ( Scholkopf and Smola . 20021 ). Since !K is an RKHS, the operator 
of evaluation 6x mapping / E Jf to f{x) G M is continuous. Thus, by the Riesz represen- 
tation theorem, there is a feature mapping (p{x) from X to M such that f{x) = (/, 0(2;))r^. 
Moreover, {(p{x) , (p{y)) = k(x,y), where k ( x,y) is a positive definite kernel function. The 
following lemma is due to lBorgwardt et al.l (j2006l ^. 

Lemma 4 Denote the expectation of cf>{x) by fip := Ep[cf>{x)] (assuming its existence)^ 
Then 



Proof 



MMB[3^,p,q]= sup {fi\p] - fi[q], f) = y\p] - fi[ 



11/11 



MMD2[J,p,g] 



sup (Ep[/(x)]-E,[/(2/)]) 



sup (Ep[(</.(x),/)^]-E,[(0(y),/)^]) 
/ll^<i 



sup (/ip - m<?,/):k 
/ll^<i 



(3) 



Given we are in an RKHS, the norm — ^gll^^^; may easily be computed in terms of kernel 
functions. This leads to a first empirical estimate of the MMD, which is unbiased. 

Lemma 5 Given x and x' independent random variables with distribution p, and y and y' 
independent random variables with distribution q, the population MMD^ is 

MMD2 [J, p, q] = E^,^.^p [k{x, x')] - 2E^.^p,y^g [k{x, y)] + Ey^y,^^ [k{y, y')] . (4) 

Let Z := ( 

zi,...,Zm) be m i.i.d. random variables, where Zi : — {xi,yi) (i.e. we assume 
m = n). An unbiased empirical estimate o/MMD^ is 

m 

m — 1) ^—^ 



MMBl [3", X, Y] 



{m){m — 1) 



(5) 



which is a one-sample U-statistic with h{zi, Zj) := k{xi,Xj) + k{yi,yj) — k{xi,yj) — k{xj,yi) 
(we define h{zi,Zj) to be symmetric in its arguments due to requirements that will arise in 
Section\5\). 

Proof Starting from the expression for MMB^IS', p,q] in Lemma H 
MMB^[3;p,q] = ||/xp-/ig||^ 

= Ep {(t>{x),c^{x'))^ + E, {ct>{y), cPiy'))^ - 2Ep,g {cPix), cPiy))^^, 



2. A sufficient condition for this is HaipHjc < oo, wtiicti is rearranged as Ep[k{x, x')] < oo, wtiere x and x' 
are independent random variabies drawn according to p. In other words, is a trace ciass operator with 
respect to the measure p. 
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Figure 1: Illustration of the function maximizing the mean discrepancy in the case where 
a Gaussian is being compared with a Laplace distribution. Both distributions 
have zero mean and unit variance. The function / that witnesses the MMD has 
been scaled for plotting purposes, and was computed empirically on the basis of 
2 X 10^ samples, using a Gaussian kernel with a = 0.5. 



The proof is completed by applying {(j){x), (l){x'))ry^ = k{x, x'); the empirical estimate follows 
straightforwardly. ■ 

The empirical statistic is an unbiased estimate of MMD^, although it does not have mini- 
mum variance, since we are ignoring the cross-terms kjxj , Uj) of which there are only 0(n). 
The minimum variance estimate is almost identical, though ( Serfiing . IQSd . Section 5.1.4). 

The biased statistic in ([2]) may also be easily computed following the above reasoning. 
Substituting the empirical estimates fJ.[X] := ^ X^I^i 4>{xi) ^-iid fj,[Y] := ^ Y^^=i 4>{yi) of the 
feature space means based on respective samples X and Y, we obtain 



MMDf, [J, X, Y] 



^ m 



Xi, Xj) 



«J=1 



2 

mn 



m,n ^ n 



(6) 



Intuitively we expect the empirical test statistic MMD[9", X, y], whether biased or un- 
biased, to be small iip = q, and large if the distributions are far apart. It costs 0((m + n)^) 
time to compute both sta tistics. 

Finally, we note that Harchaoui et al. ( 20081 ) recently proposed a modification of the 
kernel MMD statistic in Lemma HI by scaling the feature space mean distance using the 
inverse within-sample covariance operator, thus employing the kernel Fisher discriminant as 
a statistic for testing homogeneity. This statistic is shown to be related to the divergence. 



2.3 Witness Function of the MMD for RKHSs 

It is also instructive to consider the witness / which is chosen by MMD to exhibit the 
maximum discrepancy between the two distributions. The population / and its empirical 
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estimate /(x) are respectively 

f{x) oc {(l){x),fi[p\ - fi[q\) = E^f^p[k{x,x')]-E^fr^q[k{x,x')] 

fix) oc {Hx),f^[X]-f,[Y]) = iY:^Uk{x.,x)-lYri=lKy^,^)■ 

This follows from the fact that the unit vector v maximizing {v,x)r^ in a Hilbert space is 
V = x/ \\x\\. 

We illustrate the behavior of MMD in Figure [1] using a one-dimensional example. The 
data X and Y were generated from distributions p and q with equal means and variances, 
with p Gaussian and q Laplacian. We chose to be the unit ball in an RKHS using the 
Gaussian kernel. We observe that the function / that witnesses the MMD — in other 
words, the function maximizing the mean discrepancy in ([T|) — is smooth, positive where 
the Laplace density exceeds the Gaussian density (at the center and tails), and negative 
where the Gaussian density is larger. Moreover, the magnitude of / is a direct reflection of 
the amount by which one density exceeds the other, insofar as the smoothness constraint 
permits it. 

2.4 The MMD in Other Function Classes 

The definition of the maximum mean discrepancy is by no means limited to RKHS. In fact, 
any function class J' that comes with uniform convergence guarantees and is sufficiently 
powerful will enjoy the above properties. 

Definition 6 Let y he a subset of some vector space. The star S[y] of a set 9" is 

S[3^\ := {ax\x G 9" and a £ [0, oo)} 

Theorem 7 Denote by 3^ the subset of some vector space of functions from X to M for which 
SIS'] n C(X) is dense in C(X) with respect to the Loo(X) norm. Then MMD [J'jp, g] = if 
and only if p = q. 

Moreover, under the above conditions MMD[9",p, q] is a metric on the space of probability 
distributions. Whenever the star of 3^ is not dense, MMD is a pseudo-metric spaced 

Proof The first part of the proof is almost identical to that of Theorem [3] and is therefore 
omitted. To see the second part, we only need to prove the triangle inequality. We have 

sup \Epf - EJ\ + sup \Egg - Erg\ > sup [\Epf - Egf\ + \EJ - Er\] 
fe9 ses- f£3^ 

> sup \Epf - Erf\ . 

The first part of the theorem establishes that MMD[ir,p, q] is a metric, since only for p = q 
do we have MMD[3",p, q] = 0. ■ 

Note that any uniform convergence statements in terms of 3" allow us immediately to char- 
acterize an estimator of MMD(3", p, g) explicitly. The following result shows how (we will 
refine this reasoning for the RKHS case in Section Hj). 

3. According to iDudlevI (|2002l . p. 26) a metric d{x,y) satisfies the following four properties: symmetry, 
triangle inequality, d(x,x) — 0, and d[x,y) — =^ x — y. A pseudo-metric only satisfies the first three 
properties. 
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Theorem 8 Let 5 £ (0, 1) be a confidence level and assume that for some e{5, m, 3") the 
following holds for samples {xi, . . . , Xm} drawn from p: 



{ 1 

Pr sup E,[/]--^/(. 



> e{5, m, J) ^ < 5. 



In this case we have that 

Pr{|MMD[J,p,g] -MMDfe[J,X,y]| > 2e((5/2, m, J)} < 5. 
Proof The proof works simply by using convexity and suprema as fohows: 
I MMD [J, p, q] - MMDf, [J, X, Y] \ 



(7) 



(8) 



sup|Ep[/] -Eg[/]| - sup 



< sup 



< sup 



m ^-^ n ^-^ 

1=1 I- 

E,[/]-E,[/]-lx;/(x.) + ix;/(?^^ 

m ^ — ^ r) ^ — ^ 



i=l 



i=l 



i=l 



i=l 



^ m 1 



i=l 



Bounding each of the two terms via a uniform convergence bound proves the claim. ■ 

This shows that MMDf,[3", X, Y] can be used to estimate MMD[3",p, q] and that the quantity 
is asymptotically unbiased. 

Remark 9 (Reduction to Binary Classification) Any classifier which maps a set of 
observations {zi, li} with Zi £% on some domain X and labels li £ {±1}, for which uniform 
convergence bounds exist on the convergence of the empirical loss to the expected loss, can 
be used to obtain a similarity measure on distributions — simply assign li = 1 if Zi £ X and 
li = —1 for Zi £ Y and find a classifier which is able to separate the two sets. In this case 
maximization o/ Ep[/] — Eg[/] is achieved by ensuring that as many z ~ p{z) as possible 
correspond to f{z) = 1, whereas for as many z ~ q{z) as possible we have f{z) = —1. 
Consequently neural networks, decision trees, boosted classifiers and other objects for which 
uniform convergence bou nds can be obtain e d can be used for the purpose of distribution 
comparison. For instance. Wen-David et al\ 1200% Section 4) us e the error of a hyp erylane 
classifier to approximate the A- distance between distributions of Kifer et al. (200^) . 



2.5 Examples of Non-RKHS Function Classes 

Other function spaces 3" inspired by the statistics literature can also be considered in defining 
the MMD. Indeed, Lemma [T] defines an MMD with 3^ the space of bounde d continuous r eal- 
valued functions, which is a Banach space with the supremum norm ( Dudley . 20021 . p. 
158). We now describe two further metrics on the space of probability distributions, the 
Kolmogorov-Smirnov and Earth Mover's distances, and their associated function classes. 
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2.5.1 Kolmogorov-Smirnov Statistic 



The Kolmogorov-Smirnov (K-S) test is probably one of the most famous two-sample tests in 
statistics. It works for random variables x G M (or any other set for which we can establish 
a total order). Denote by Fp{x) the cumulative distribution function of p and let Fx{x) be 
its empirical counterpart, that is 

^ m 

Fp{z) := Fr{x< z for x ~ p{x)} and Fx{z) ■= jj^]'^ lz<x,- 



i=l 



It is clear that Fp captures the properties of p. The Kolmog orov metr i c is si mply the Loo 
distance \\Fx — -^y|loo ^^^^ observations X and Y. ISmirnov (193^) showed that 

for p = q the limiting distribution of the empirical cumulative distribution functions satisfies 



lim Pr 

m,n—> 00 



m+n 



Fx - Fy\\^ >x \ = 2^(-l)-'-ie-2i'^' for x > 0. 



(9) 



This allows for an efficient characterization of the distribution under the null hypothesis 
" Kf). Efficient nume rical approximations to ([9]) can be found in numerical analysis handbooks 
(jPress et al.1 . Il994l ). The distribution under the alternative, p q, however, is unknown. 

The Kolm qgorov r netric is, in fact, a special instance of MMD[3", p, g] for a certain 
Banach space (iMiilleii . 119971 . Theorem 5.2) 



Proposition 10 Let 3^ be the class of functions X ^ M 0/ bounded variatioi^ 1. Th 
MMD[J,p,q] = \\1 



F 



2.5.2 Earth-Mover Distances 

Another class of distance measures on distributions that may be written as an MMD are 
the Earth- Mover distances. We assume (X, d) is a separable metric space, and define yi(X) 
to be the space of probability measures on X for which J d{x, z)dp{z) < 00 for all p £ J'i(X) 
and X G jC (these are the pro bability rneasu res for which E |x| < 00 when X = M). We then 
have the following definition (Dudley, 20021 . p. 420). 



Definition 11 (Monge-Wasserstein metric) Letp G J'i(X) andq G !Pi(X). The Monge- 
Wasserstein distance is defined as 



W{p,q):= inf d{x,y)dfi{x,y), 
where M{p, q) is the set of joint distributions on X x X with marginals p and q. 



4. A function / defined on [a, b] is of bounded variation C if the total variation is bounded by C, i.e. the 
supremum over all sums 

J2 1/(^0-/(^.-1)1, 



where a < a;o < . . . < a;„ < 6 jPudlevl [2003 . p. 184). 
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We may interpret this as the cost (as represented by the metric d{x, y)) of transferring mass 
distributed according to p to a distribution in accordance with q, where is the movement 
schedule. In general, a large variety of costs of moving mass from x lo y ca n be used, such 
as psychooptic al similarity me asures in image retrieval ( Rubner et al. . 200d ) . The following 
theorem holds (jPudlevl . [jnoi . Theorem 11.8.2). 

Theorem 12 (Kantorovich- Rubinstein) Let p G J'i(X) and q G where X is 

separable. Then a metric on J'i(5') is defined as 



W{p,q) = \\p- 



where 



sup 



sup / fd{p-q) 

ll/IL<i 

\m-f{y)\ 

d{x,y) 



is the Lipschitz seminorn^ for real valued f on X. 

A simple example of this theorem is as follows (iDudlevl . l2002l . Exercise 1, p. 425). 

Example 1 Let X = M with associated d{x,y) = \x — y\. Then given f such that 
we use integration by parts to obtain 



< 1, 



/ d{p - q) 



iFp-Fg){x)f'{x)dx 



< 



\{Fp-Fg)\{x)dx, 



where the maximum is attained for the function g with derivative g' = 21fp^f^ — 1 (and for 
which \\g\\]^ = Ij- We recover the Li distance between distribution functions, 



W{P,Q) = J \{Fp-Fg)\ix)dx. 

One may further gener alize Theorem [12] to the set of all laws J'(X) on arbitrary metric 
spaces X (|Dudlevl . l20oi . Proposition 11.3.2). 

Definition 13 (Bounded Lipscliitz metric) Let p and q be laws on a metric space X. 
Then 



11/11 



sup 



<i 



/ d{p - q) 



is a metric on y(X), where f belongs to the space of bounded Lipschitz functions with norm 

\BL ■■= II/IIl + 



3. Background Material 

We now present three background results. First, we introduce the terminology used in 
statistical hypothesis testing. Second, we demonstrate via an example that even for tests 
which have asymptotically no error, one cannot guarantee performance at any fixed sample 
size without making assumptions about the distributions. Finally, we briefly review some 
earlier approaches to the two-sample problem. 

5. A seminorm satisfies the requirements of a norm besides ||a;|| — only for a; = (|Dudlevll2003 . p. 156). 
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3.1 Statistical Hypothesis Testing 

Having described a metric on probability distributions (the MMD) based on distances be- 
tween their Hilbert space embeddings, and empirical estimates (biased and unbiased) of 
this metric, we now address the problem of determining whether the empirical MMD shows 
a statistically significant difference between distributions. To this end, we briefly describe 
the framework of statistic al hypothesis testing as it applies in the present context, following 
Casella and Berger Chapter 8). Given i.i.d. samples X ~ p of size m and y ~ g of 



size n, the statistical test, 7{X, Y) : X™ x X" i-^ {0, 1} is used to distinguish between the 
null hypothesis 'Kq '■ p = q and the alternative hypothesis JCi '■ p ^ Q- This is achieved by 
comparing the test statisticjfl MMD[9", X, y] with a particular threshold: if the threshold is 
exceeded, then the test rejects the null hypothesis (bearing in mind that a zero population 
MMD indicates p = q)- The acceptance region of the test is thus defined as the set of real 
numbers below the threshold. Since the test is based on finite samples, it is possible that an 
incorrect answer will be returned: we define the Type I error as the probability of rejecting 
p = q based on the observed sample, despite the null hypothesis having generated the data. 
Conversely, the Type II error is the probability of accepting p = q despite the underlying 
distributions being different. The level a of a test is an upper bound on the Type I error: 
this is a design parameter of the test, and is used to set the threshold to which we compare 
the test statistic (finding the test threshold for a given a is the topic of Sections S] and [5]) . 
A consistent test achieves a level a, and a Type II error of zero, in the large sample limit. 
We will see that the tests proposed in this paper are consistent. 

3.2 A Negative Result 

Even if a test is consistent, it is not possible to distinguish distributions with high probability 
at a given, fixed sample size (i.e., to provide guarantees on the Type II error), without prior 
assumptions as to the nature of the difference between p and q. This is true regardless 
of the two-sample test used. There are several ways to illustrate this, which each give 
different insight into the kinds of differences that might be undetectable for a given number 
of samples. The following exampl^ is one such illustration. 



Example 2 Assume that we have a distribution p from which we draw m iid observa- 
tions. Moreover, we construct a distribution q by drawing iid observations from p and 
subsequently defining a discrete distribution over these vn? instances with probability 
each. It is easy to check that if we now draw m observations from q, there is at least a 
C^)^^^ > 1 — e""*^ > 0.63 probability that we thereby will have effectively obtained an m 
sample from p. Hence no test will be able to distinguish samples from p and q in this case. 
We could make the probability of detection arbitrarily small by increasing the size of the 
sample from which we construct q. 



6. This may be biased or unbiased. 

7. This is a variation of a construction for independence tests, which was suggested in a private communi- 
cation by John Langford. 
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3.3 Previous Work 



We next give a brief overview of some earlier approaches to the two sample problem for 
multivariate data. Since our later experimental comparison is with respect to certain of these 
methods, we give abbreviated algorithm names in italics where appropriate: these should 
be used as a key to the tables in Section \8\ A generalisatior i of the Wald-Wolfowi t z run s 
test to the multivar i ate do main was proposed and analysed by Friedman and Rafskv ( 19791 ): 
Henze and Penrose ( 19991 ) (FR Wolf), and involves counting the number of edges in the 
minimum spanning tree over the aggregated data that connect points in X to points in Y. 
The resulting test relies on the asymptotic normality of the test statistic, and this quantity 
is not distribution-free under the null hypothesis for finite samples (it depends on p and q). 
The computational cost of this method using Kruskal's algorithm is 0((m + n)'^ logfm + n)) , 
although m ore modern methods i n iprov e on the log(m + n) term. See IChazelld (2000) 
for details. Friedman and Rafsky ( 19791 ) claim that calculating the matrix of distances, 
which costs 0((m + n)^), dominates their computing time; we return to this point in our 
experiments (Section [H). Two possible generali sations of the Kolmogorov-Smirno v test 
to the multivariate case were studied in fBickel. il969l : IPriedman and Rafskvl . Il979l 'l. The 
approach of Friedman and Rafsky (FR Smirnov) in this case again requires a minimal 
spanning tree, and has a similar cost to their multivariate runs test. 



A more recent multivariate test was introduced by iRosenbaumI (120051 ) . This entails 
computing the minimum distance non-bipartite matching over the aggregate data, and using 
the number of pairs containing a sample from both X and y as a test statistic. The resulting 
statistic is distribution-free under the null hypothesis at finite sample sizes, in which respect 
it is superior to the Friedman-Rafsky test; on the other hand, it costs 0((m + n)^) t o 
compute. Another distribution-free test (Hall) was proposed by Hall and Tajvidi ( 20021 ): 
for each point from p, it requires computing the closest points in the aggregated data, and 
counting how many of these are from q (the procedure is repeated for each point from q 
with respect to points from p ). As we shall see in ou r experimental comparisons, the test 
statistic is costly to compute; Hall and Tajvidil ( 20021 ) consider only tens of points in their 
experiments. 



Yet another approach is to use some dis tance (e.g. Li or Ly) between Parzen window 



estimates of the densities as a test statistic (jAnderson et al.l . 11994 : iBiau and Gvorfil . |2005| ) . 
based on the asymptotic distribution of this distance given p = q. When the L2 norm is 
used, the test statistic is related to those w 'e present here, a lthou gh it is arrived at from 



a different perspective. Briefly, the test of Anderson et al. ( 19941 ) is obtained in a more 



restricted setting where the RKHS kernel is an inner product between Parzen windows. 
Since we are not doing density estimation, however, we need not decrease the kernel width 
as the sample grows. In fact, decreasing the kernel width reduces the convergence rate 
of the associated two-sample test, compared with the (m + n)~^^'^ rate for fixed kernels. 
We provide more detail in Section EH The Li approach of lBiau and Gyorfil (120051 ) (Biau) 
requires the space to be partitioned into a grid of bins, which becomes difficult or impossible 
for high dimensional problems. Hence we use this test only for low-dimensional problems 
in our experiments. 
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4. Tests Based on Uniform Convergence Bounds 

In this section, we introduce two statistical tests of independence which have exact perfor- 
mance guarantees at fin ite sample si z es, ba sed on uniform convergence bounds. The first, 



in Section l4.ll uses the iMcD iarmid* (1989) bound on the biased MMD statistic, and the 



second, in Section [321 uses a lHoeffding bound for the unbiased statistic 



4.1 Bound on the Biased Statistic and Test 

We establish two properties of the MMD, from which we derive a hypothesis test. First, we 
show that regardless of whether or not p = q, the empirical MMD converges in probability 
at rate 0((m + n)~2) to its population value. This shows the consistency of statistical 
tests based on the MMD. Second, we give probabilistic bounds for large deviations of the 
empirical MMD in the case p = q- These bounds lead directly to a threshold for our 
first hypothesis test. We begin our discussion of the convergence of MMD;, [ J", X, y ] to 
MMDiS", p,q]. 

Theorem 14 Let p,q,X,Y be defined as in ProblemUl and assume < k{x,y) < K. Then 

Pr {|MMDfe[3-,X,y] - MMD[J,p,g]| > 2 ({K/m)'^ + {K/n)^) + e} < 2exp {j^^^) • 

See Appendix IA.2I for proof. Our next goal is to refine this result in a way that allows us 
to define a test threshold under the null hypothesis p = q. Under this circumstance, the 
constants in the exponent are slightly improved. 

Theorem 15 Under the conditions of TheoremH^ where additionally p = q and m = n, 



MMDfe[:J, X, Y] < m-\ -^2Ep \k{x, x) - k{x, x')] + e < {2K/m)^/^ + e, 
both with probability at least 1 — exp ( — ) (see Appendix \A . 31 for the proof). 



In this theorem, we illustrate two possible bounds Bi{'3',p) and B2{3',p) on the bias in the 
empirical estimate ([6]). The first inequality is interesting inasmuch as it provides a link 
between the bias bound Bi(3',p) and kernel size (for instance, if we were to use a Gaussian 
kernel with large a, then k{x,x) and k{x,x') would likely be close, and the bias small). 
In the context of testing, however, we would need to provide an additional bound to show 
convergence of an empirical estimate of Bi{3',p) to its population equivalent. Thus, in the 
following test for p = q based on Theorem [T5l we use B2{3',p) to bound the biasH 

Corollary 16 A hypothesis test of level a for the null hypothesis p = q, that is, for 
MMD[g",p,g] = 0, has the acceptance region MMDb[J,X,y] < ^/2K/m [l + ^/TkigcF^ 

We emphasise that Theorem 1141 guarantees the consistency of the test, and that the Type 
II error probability decreases to zero at rate 0(m~2), assuming m = n. To put this con- 
vergence rate in perspective, consider a test of whether two normal distributions have equal 



8. Note that we use a tighter bias bound than lGretton et al.l l|2007al ). 
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means, given they have unknown but equal variance (jCasella and Bereerl . bool Exercise 



8.41). In this case, the test statistic has a Student-i distribution with n + m — 2 degrees of 
freedom, and its error probabihty converges at the same rate as our test. 

It is worth noting that bounds may be obtained for the deviation between expectations 
fj,\p] and the empirical means /u[X] in a completely analogous fashion. The proof requires 
symmetrization by means of a ghost sample, i.e. a second set of observations drawn from 
the same distribution. While not the key focus of the present paper, s uch bounds can be 



used in the design of inference principles ba, s ed on moment matching (jAltun and Smolal . 
2nn()l : budik and Schapirel . bnOfil : IPudfk et al.l . 120041 ^ . 



4.2 Bound on the Unbiased Statistic and Test 

While the previous bounds are of interest since the proof strategy can be used for general 
function classes with well behaved Rademacher averages, a much easier approach may be 
used directly on the unbiased statistic MMD^ in LemmaO We base our test on the following 
th eorem, wh i ch is a straightforward application of the large deviation bound on U-statistics 



of lHoeffdind (jl96l p. 25). 



Theorem 17 Assume < k{xi,Xj) < K, from which it follows —2K < h{zi,Zj) < 2K . 
Then ^ 

Pr{MMD2(J,X,y)-MMD2(3-,p,g) >t} <exp 

where m2 := [m/2j (the same hound applies for deviations of —t and below). 
A consistent statistical test for p = q using MMD^ is then obtained. 

Corollary 18 A hypothesis test of level a for the null hypothesis p = q has the acceptance 
region MMD^ < {4K/^/m) Vlog(a-i). 

We now compare the thresholds of the two tests. We note first that the threshold for the 
biased statistic applies to an estimate of MMD, whereas that for the unbiased statistic 
is for an estimate of MMD^. Squaring the former threshold to make the two quantities 
comparable, the squared threshold in Corollary 1161 decreases as m~^, whereas the threshold 
in Corollarv 1181 decreases as m~^/^. Thus for sufficiently larg^ m, the McDiarmid-based 
threshold will be lower (and the associated test statistic is in any case biased upwards), and 
its Type II error will be better for a given Type I bound. This is confirmed in our Section 
[8] experiments. Note, however, that the rate of convergence of the squared, biased MMD 
estimate to its population value remains at l/^/m (bearing in mind we take the square of 
a biased estimate, where the bias term decays as l/^/rn). 

Finally, we note that the bounds we obtained here are rather conservative for a number 
of reasons: first, they do not take the actual distributions into account. In fact, they 
are finite sample size, distribution free bounds that hold even in the worst case scenario. 
The bounds could be tightened using localization, moments of the distribution, etc. Any 
such improvements co uld be plugged straight into Theorem [8] for a tighter bound. See e.g. 
Bousquet et~aD ^200^ ) for a detailed discussion of recent uniform convergence bounding 



9. In the case of a = 0.05, this is m > 12. 
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methods. Second, in computing bounds rather than trying to characterize the distribution 
of MMD(9", X, Y) exphcitly, we force our test to be conservative by design. In the following 
we aim for an exact characterization of the asymptotic distribution of MMD([J', X, Y) instead 
of a bound. While this will not satisfy the uniform convergence requirements, it leads to 
superior tests in practice. 



5. Test Based on the Asymptotic Distribution of the Unbiased Statistic 



We now propose a third test, which is based on the asymptotic distribution of the unbiased 
estimate of MMD^ in Lemma [5l 

Th eorem 19 We assume E (h^) < oo. Under "Ki, MMD^ converges in distribution (see 
e.g. \ Grimmet and Stirzaken . \200A . Section 7.2) to a Gaussian according to 



m2 



i (MMD2 - MMD^ [3-,p, q]) ^ J< (O, al) , 
where al = 4 ( E^ [(E^//i(z, z'))^] — [E^^^/(/i(2;, z'))] ^ j , uniformly at rate Xj ^pm \Serfiin\ . 



198ui . Theorem B, p. 193). Under "Kq, the U-statistic is degenerate, meaning 'Eiz'h{z, z') 
0. In this case, MMD^ converges in distribution according to 



mMMD: 



2 D 



(10) 



1=1 



where zi ~ 3sf(0, 2) i.i.d., Aj are the solutions to the eigenvalue equation 



X 



k{x,x')'ipi{x)dp{x) = \ii)i{x'), 



and k{xi,Xj) := k{xi,Xj) — Exk{xi,x) — Exk{x,Xj) + Eixyk{x,x') is the centred RKHS 
kernel. 



The asymptotic distribution of the test statistic und er !Ki is given by ISerfiin 

3 (|l98d . 

Section 5.5.1), a nd th e distribution under 'Kq follows ISerfiingi (jl98d . Section 5.5.2) and 
Anderson et al.l (|1994| . Appendix); see Appendix IB. II for details. We illustrate the MMD 
density under both the null and alternative hypotheses by approximating it empirically for 
both p = q and p q. Results are plotted in Figure [2j 

Our goal is to determine whether the empirical test statistic MMD^ is so large as to 
be outside the 1 — a quantile of the null distribution in (jlOp (consistency of the resulting 
test is guaranteed by the form of the distribution under !Ki). One way to estimate th is 
quantile is using the bootstrap on the aggregated data, following I Arcones and Ginel (|l992l ). 
Alternatively, we may approxi i nate t he null distribution by fitting Pearson curves to its first 
four moments (jjohnson et al.l . Il994l . Section 18.8). Taking advantage of the degeneracy of 
the U-statistic, we obtain (see Appendix IB. 2p 



([ 



E MMD 



212 



([ 



E MMD 



213 



m{m — 1) 
8(m-2) 
m?{m — 1)2 



-E,,,, [/i2(z,z')] and (11) 



E^,^/ [h{z, z')Ey' [K^, z")h{z', z"))] + 0{m-^). (12) 
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Figure 2: Left: Empirical distribution of the MMD under "Kq, with p and q both Gaussians 
with unit standard deviation, using 50 samples from each. Right: Empirical 
distribution of the MMD under "Ki, with p a Laplace distribution with unit 
standard deviation, and q a Laplace distribution with standard deviation 3\/2, 
using 100 samples from each. In both cases, the histograms were obtained by 
computing 2000 independent instances of the MMD. 



The fourth moment E ^MMD^] j is not computed, since it is both very small, 0{m ^), 
and expensive to calculate, O(m^). Instead, we replace the kurtosij^ with a lower bound 



due to lWilkinsI jldM ). kurt (MMD^) > (skew (MMD^))^ + 1. 



Note that MMD^ may be negative, since it is an unbiased estimator of (MMD[9", p, g])^. 
However, the only terms missing to ensure nonnegativity are the terms h{zi,Zi), which were 
removed to remove spurious correlations between observations. Consequently we have the 
bound 



MMDf 



^ m 

1 + —, —^k{xi,Xi) + k{yi, 

mini — 1) ^-^ 
1=1 



Vi) - 2k{xi,yi) > 0. 



(13) 



6. A Linear Time Statistic and Test 

While the above tests are already more efficient than the 0(m^ log m) and 0{m^) tests 
described earlier, it is still desirable to obtain 0{m) tests which do not sacrifice too much 
statistical power. Moreover, we would like to obtain tests which have 0(1) storage require- 
ments for computing the test statistic in order to apply it to data streams. We now describe 
how to achieve this by computing the test statistic based on a subsampling of the terms in 
the sum. The empirical estimate in this case is obtained by drawing pairs from X and Y 
respectively without replacement. 



10. The kurtosis is defined in terms of the fourth and second moments as kurt (MMD^) — 



i;([MMDj]^)]' 
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Lemma 20 Recall m2 '■= [m/2\. The estimator 

MMDf[3-,X,y] ■.= —y^h{{x2i^i,y2i^i),{x2i,y2i)) 
ni2 ^-^ 

can he computed in linear time. Moreover, it is an unbiased estimate of MMD'^I'J, p, q]. 

While it is expected (as we will see explicitly later) that MMD^ has higher variance than 
MMD^, it is computationally much more appealing. In particular, the statistic can be used 
in stream computations with need for only 0(1) memory, whereas MMD^ requires 0{m) 
storage and 0{m?) time to compute the kernel h on all interacting pairs. 

bmce MMDf is just the average over a set of random variables, Hoeffding's bound 
and the central limit theorem readily allow us to provide both uniform conver gence and 



asymp totic statements for it with little effort. The first follows directly from iHoeffding 



tnp 

(|l96i . Theorem 2). 



Theorem 21 Assume < k{xi,Xj) < K. Then 

Pr{MMD^(J,X,y) - MMD2(3-,p,g) > t] < exp 



-t^m2 



8i^2 

where m2 ■= [m/2\ (the same bound applies for deviations of —t and below). 

Note that the bound of Theorem [T7] is identical to that of Theorem [2T1 which shows the 
former is rather loose. Next we invoke the central limit theorem. 

Corollary 22 Assume < E (/i^) < oo. Then MMD^ converges in distribution to a 
Gaussian according to 



where af = 2 



m5 (MMDf - MMD^ ['J, p,q]) ^ J< {0,af) , 
^z,z'h'^{z, z') — [Ez^z'h{z, z')]'^ , uniformly at rate l/^/m. 



The factor of 2 arises since we are averaging over only [m/2\ observations. Note the 
difference in the variance between Theorem [19] and Corollary [22l namely in the former 
case we are interested in the average conditional variance EzVarz'[h{z, z')\z], whereas in the 
latter case we compute the full variance Yaiz,z'[hiz, z')]. 

We end by noting another potential approach to reducing the com putational cost of the 



MMD, by computing a low rank approximation to the Grain matrix (jFine and Scheinberg 



200ll : IWilliams and Seegerl . I2OO1I : ISmola and Scholkopi I2OO0I) . An incremental computation 



of the MMD based on such a low rank approximation would require 0{md) storage and 
0{md) computation (where d is the rank of the approximate Gram matrix which is used 
to factorize both matrices) rather than 0{m) storage and 0{m?) operations. That said, it 
remains to be determined what effect this approximation would have on the distribution of 
the test statistic under "Kq, and hence on the test threshold. 
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7. Similarity Measures Related to MMD 

Our main point is to propose a new kernel statistic to test whether two distributions are 
the same. However, it is reassuring to observe Hnks to other measures of similarity between 
distributions. 



7.1 Link with L2 Distance between Parzen Window Estimates 

In this section, we demo nstrate the connectioii between our test statistic and the Parzen 
window-based statistic of I Anderson et al.l (119941 ). We show that a two-sample test based on 
Parze n windows converges more slowly than an RKHS-based test, also following [Anderson et al 
(jl994l ). Before proceeding, we motiv ate this discussio n with a short overview of the Parzen 
window estimate and its properties ( Silverman . 19861 ). We assume a distribution p on M'^, 
which has an associated density function also written p to minimise notation. The Parzen 
window estimate of this density from an i.i.d. sample X of size m is 



p{x) 



— y k{xi — x) where k satisfies / k (x) dx = 1 and k (x) > 0. 
mf-; Jx 



We may rescale k according to j^f^ \T^)' Consistency of the Parzen window estimate 

lim /i^ = and lim mh'^ = 00. (14) 



requires 



We n ow show that the L2 distance between Parzen windows density estimates (jAnderson et al. 



19941 ) is a special case of the biased MMD in equation ([6]). Denote by Dr{p,q) := \\ p — q 
the L r distance. For r = 1 the distance Dr{p,q) is known as the Levy distance (jFeller 
197lh. and for r = 2 we e ncounter distance measures derived from the Renyi entropy 



(jGokcav and Principd . |2002| ). 

Assume that p and q are given as kernel density estimates with kernel k(x — x' 
is, p{x) = Yli ^{^i ~ ^) s-^d q{y) is defined by analogy. In this case 



/ — v K{xi - ^) - - i^ivi - z 

''Li i 

^ m 1 " 2 

^ E ^(^^ - + ^Y1 ^^yi - y^) - — E 



that 

(15) 
(16) 



wher e k{x — y) = J k{x — z)K{y — z)dz. By its definition k[x — y) is a Mercer kernel (jMercerl . 
19091 ^. as it can be viewed as inner product between k{x — z) and K{y — z) on the domain 
X. 

A disadvantage of the Parzen window interpretation is that when the Parzen window 
estimates are consistent (which requires the kernel size to decrease with increasing sample 
size), t he resulting two - sampl e test converges more slowly than using fixed kernels. Accord' 



ing to Anderson et al. ( 19941 . p. 43), the Type II error of the two-sample test converges 



as m 

-V2/j-'^/2. Thus, given the schedule for the Parzen window size decrease in ()14|] . the 
convergence rate will lie in the open interval (0, 1/2): the upper limit is approached as 
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decreases more slowly, and the lower limit corresponds to decreasing near the upper 
bound of 1/m. In other words, by avoiding density estimation, we obtain a better con- 
vergence rate (namely m~^/^) than using a Parzen window estimate with any permissible 
bandwidth decrease schedule. In addition, the Parzen window interpretation cannot ex- 
plain the excellent performance of MMD based tests in experimental settings where the 
dimensionality greatly exceeds the sample size (for instance the Gaussian toy example in 
Figure HB, for which performance actually improves when the dimensionality increases; and 
the microarray datasets in Tabled]). Finally, our tests are able to employ universal kernels 
that cannot be written as inn er products betw een Parzen win dows, normalized or ot herwise: 
several examples are given bv lSteinwartI (|200ll . Section 3) and lMicchelli et al. I (|2006l . Section 
3). We may further ge neralize to kernels on structured objects such as strings and graphs 
(jScholkopf et al.l . 12004 ) : see also our experiments in Section [HI 



7.2 Set Kernels and Kernels Between Probability Measures 

Gartner et all tooi ) propose kernels to deal with sets of observations. These are then used 
in the context of Multi-Instance Classification (MIC). The problem MIC attempts to solve 
is to find estimators which are able to infer from the fact that some elements in the set 
satisfy a certain property, then the set of observations has this property, too. For instance, 
a dish of mushrooms is poisonous if it contains poisonous mushrooms. Likewise a keyring 
will open a door if it contains a suitable key. One is only given the ensemble, however, 
rather than information abo ut which instance of the set satisfies the property. 

The solution proposed by Gartner et al. ( 20021 ) is to map the ensembles Xi := {xn , . . . , a^j^. 
where i is the ensemble index and rrii the number of elements in the ith ensemble, jointly 
into feature space via 



(17) 



and use the latter as the basis for a kernel method. This simple approach affords rather 
good performance. With the benefit of hindsight, it is now understandable why the kernel 



k{Xi 



1 



E 

u,v 



jvj 



(18) 



produces useful results: it is simp ly the ker nel between the empi r ical m eans in feature space 
{n{Xi),fi{Xj)) (|Hein et al.l . bool Eq. 4). Ijebara and Kondoil (|2003l l later extended this 
setting by smoothing the empirical densities before computing inner products. 

Note, however, that property testing for distributions is probably not optimal when 
using the mean ^[p] (or IJ,[X] respectively): we are only interested in determining whether 
some instances in the domain have the desired property, rather than making a statement 
regarding t he distribution of tho se instances. Taking this into account leads to an improved 
algorithm ( Andrews et al. . 20031 ). 



7.3 Kernel Measures of Independence 

We next demonstrate the application of MMD in determining whether two random variables 
X and y are independent. In other words, assume that pairs of random variables {xi,yi) 
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are jointly drawn from some distribution p := Pr^^y. We wish to determine whether this 
distribution factorizes, i.e. whether q := Pr^jPr^ is the same as p . One apphc ation of 
such an independence measure is in independent component analysis (IComon . where 



the goal is to find a linear mapping of the observations Xi to obt ain mutually independen t 
outputs. Kernel meth ods were employed to solve this problem by Bach and Jordan! ( 20021 ): 



Gretton et al 



(|2005al lbh. In the following we re-derive one of the above kernel independence 
measures using mean operators instead. 
We begin by defining 



^[Pr] 

xy 

and ;u[Pr x Pr] 



Ex,y [vi{x,y),-)] 



B^By [vi{x,y),-)] 



Here we assumed that V is an RKHS over X x ^ with kernel v{{x,y),{x' ,y')). If x and 
y are dependent, the equality /xpr^^y] = /u[Pr^ x Pr^] will not hold. Hence we may use 
A := ||/i[Pri;y] — /i[Pra; x Pry]|| as a measure of dependence. 

Now assume that v{{x,y), {x' ,y')) = k{x,x')l{y,y'), i.e. that the RKHS "V is a direct 
product !K S of the RKHSs on X and y . In this case it is easy to see that 
,2 iit:, rT_/,_ M n.r.. m nr.. mii2 



^xy [Hx, ■)l{y, •)] - E, [k{x, •)] Ej, [l{y, 



B^yB^,y, [k{x,x')liy,y')] - 2E,:E„E 



y^x'y' 



[k{x,x')l{y,y')] 
+E^EyE^/Ey/ [k{x,x')liy,y')] 

The latter, however, is exactly what lOretton et al. ( 2005al ) show to be the Hilbert-Schmidt 
norm of the cross-covariance operator between RKHSs: this is zero if and only if x and y 
are independent, for universal kernels. We have the following theorem: 

Theorem 23 Denote by Cxy the covariance operator between random variables x and y, 
drawn jointly from Vi^y, where the functions on X and y are the reproducing kernel Hilbert 
spaces 3' and S respectively. Then the Hilbert-Schmidt norm ||C^j/||jjs ^Quals A. 

Empirical estimates of this quantity are as follows: 

Theorem 24 Denote by K and L the kernel matrices on X and Y respectively, and by 
H = I — 1/m the projection matrix onto the subspace orthogonal to the vector with all 
entries set to 1. Then m~^ivHKHL is an estimate of IS? with bias 0(m~^). With high 
probability the deviation from A^ is 0{m^^). 

Gretton et aD (j2005al ') provide explicit constants. In certain circumstances, including in the 



case of RKHSs with Gaussian kernels, the empirical A^ may also be interpreted in terms 
of a smoothed difference betwee n the joint empiric a l characteri s tic fu nction (EOF) and the 
product of the marginal ECFs (iFeuervergeii . I1993I : iKankainenl . I199R| ). This interpretation 
does not hold in all cases, however, e.g. for kernels on strings, graphs, and other structured 
spaces. An illustration of the witness function / G 3" from Definition [2] is provided in Figure 
[3l This is a smooth function which has large magnitude where the joint density is most 
different from the product of the marginals. 

We remark that a hypothesis test based on the above kernel statistic is more complicated 
than for the two-sample problem, since the product of the marginal distributions is in effect 
sir aulated by per r nuting the variables of the original sample. Further details are provided 
bv lGretton et al.l ^2008 ). 
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Figure 3: Illustration of the function maximizing the mean discrepancy when MMD is used 
as a measure of independence. A sample from dependent random variables x and 
y is shown in black, and the associated function / that witnesses the MMD is 
plotted as a contour. The latter was computed empirically on the basis of 200 
samples, using a Gaussian kernel with a = 0.2. 
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7.4 Kernel Statistics Using a Distribution over Witness Functions 



Shawe- Taylor and Dolial (120071 ) define a distance between distributions as follows: let "K 
be a set of functions on X and r be a probability distribution over 3". Then the distance 
between two distributions p and q is given by 



D{p, q) 



E.^,[/(x)]| 



(19) 



That is, we compute the average distance between p and q with respect to a distribution of 
test functions. 



Lemma 25 Let "K he a reproducing kernel Hilbert space, f £ "K, and assume r{f) = 
r(||/||j^) with finite Ej^j,[||/|| j^] . Then D(j),q) = C \\n\p] — fJ'[q]\\^ for some constant C 
which depends only on % and r. 

Proof By definition Ep[/(a;)] = {fJ-lp],!)^- Using linearity of the inner product, Equa- 
tion (fT9]) equals 



mp]-f^[q],f):^\drif) 
I r 1 r 111 f / l^iP] -A'M 



-,/ 



drif), 



where the integral is independent of p, q. To see this, note that for any p, q, — is a 

unit vector which can turned into, say, the first canonical basis vector by a rotation which 
leaves the integral invariant, bearing in mind that r is rotation invariant. ■ 



7.5 Outlier Detection 



An application related to the two sample problem is that of outlier detection: this is the 
question of whether a novel point is generated from the same distribution as a particular 
i.i.d. sample. In a way, this is a special case of a two sample test, where the second sample 
contains only one observation. Several methods essentially rely on the distance between a 
novel point to th e sample mean in feature space to detect outliers. 

For instance. iDavv et al.l (120021') use a rela.ted me thod to deal with nonstationary time 
series. Likewise IShawe-Tavlor and Cristianinil (j2004l . p. 117) discuss how to detect novel 
observations by using the following reasoning: the probability of being an outlier is bounded 
both as a function of the spread of the points in feature space and the uncertainty in 
the empirical feature space mean (as bounded using symmetrisation and McDiarmid's tail 
bound). 

Instead of using the sample mean and variance, iTax and Duh] (| 19991 ) estimate the center 
and radius of a minimal enclosing sphere for the data, the advantage being that such b ound s 



can potentially lead to more reliable tests for single observations. IScholkopf et al.l (|200ll ) 
show that the minimal enclosing sphere problem is equivalent to novelty detection by means 
of finding a hyperplane separating the data from the origin, at least in the case of radial 
basis function kernels. 
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8. Experiments 

We conducted distribution comparisons using our MMD-based tests on datasets from three 
real-world domains: database applications, bioinformatics, and neurobiology. We investi- 
gated both uniform convergence approaches (MMD;, with the Corollarv 1161 threshold, and 
MMD^ H with the Corollary [18] threshold); the asymptotic approaches with bootstrap 
(MMD^ B) and moment matching to Pearson curves (MMD^ M), both described in Sec- 
tion \E\ and the asymptotic approach using the linear time statistic (MMD^) from Section 
IHl We also compared against several alternatives from the literature (where applicable): 
the multivariate t-test, the Friedman-Rafsky Kolmogorov-Smirnov generalisation (Smir), 
the Priedman-Rafsky Wald-Wolfowitz generalisation (Wolf), the Biau-Gyorfi test (Biau), 
and the Hall- Taj vidi test (Hall). See Section 13.31 for details regarding these tests. Note 
that we do not apply the Biau-Gyorfi test to high-dimensional problems (since the required 
space partitioning is no longer possible), and that MMD is the only method applicable to 
structured data such as graphs. 

An important issue in the practical application of the MMD-based tests is the selection 
of the kernel parameters. We illustrate this with a Gaussian RBF kernel, where we must 
choose the kernel width a (we use this kernel for univariate and multivariate data, but not 
for graphs). The empirical MMD is zero both for kernel size o" = (where the aggregate 
Gram matrix over X and y is a unit matrix) , and also approaches zero as a — > oo (where the 
aggregate Gram matrix becomes uniformly constant). We set a to be the median distance 
between points in the aggregate sample, as a com promise between these t wo extr ernes: th is 
remains a heuristic, similar to those described in iTakeuchi etaD (|2006l l: IScholkopj (|l997l ). 



and the optimum choice of kernel size is an ongoing area of research. 



8.1 Toy Example: Two Gaussians 

In our first experiment, we investigated the scaling performance of the various tests as a 
function of the dimensionality d of the space X C M.'^, when both p and q were Gaussian. 
We considered values of d up to 2500: the performance of the MMD-based tests cannot 
therefore be explained in the context of density estimation (as in Section l7.ip . since the 
associated density estimates are necessarily meaningless here. The levels for all tests were 
set at a = 0.05, m = 250 samples were used, and results were averaged over 100 repetitions. 
In the first case, the distributions had different means and unit variance. The percentage of 
times the null hypothesis was correctly rejected over a set of Euclidean distances between 
the distribution means (20 values logarithmically spaced from 0.05 to 50), was computed 
as a function of the dimensionality of the normal distributions. In case of the t-test, a 
ridge was added to the covariance estimate, to avoid singularity (the ratio of largest to 
smallest eigenvalue was ensured to be at most 2). In the second case, samples were drawn 
from distributions N(0, 1) and N(0, cr^I) with different variance. The percentage of null 
rejections was averaged over 20 a values logarithmically spaced from lO'^ ''^ to 10. The 
t-test was not compared in this case, since its output would have been irrelevant. Results 
are plotted in Figure HI 

In the case of Gaussians with differing means, we observe the t-test performs best in low 
dimensions, however its performance is severely weakened when the number of samples ex- 
ceeds the number of dimensions. The performance of MMD'^ M is comparable to the t-test 
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Figure 4: Type II performance of the various tests when separating two Gaussians, with 
test level a = 0.05. A Gaussians have same variance and different means. B 
Gaussians have same mean and different variances. 



for low sample sizes, and outperforms all other methods for larger sample sizes. The worst 
performance is obtained for MMD"^ H, though MMDj, also does relatively poorly: this 
is unsurprising given that these tests derive from distribution-free large deviation bounds, 
whereas the sample size is relatively small. Remarkably, MMDf performs quite well com- 
pared with classical tests in high dimensions. 

In the case of Gaussians of differing variance, the Hall test performs best, followed closely 
by MMD^. FR Wolf and (to a much greater extent) FR Smirnov both have difficulties in 
high dimensions, failing completely once the dimensionality becomes too great. The linear 
test MMDf again performs surprisingly well, almost matching the MMD"^ performance in 
the highest dimensionality. Both MMD^H and MMDf, perform poorly, the former failing 
completely: this is one of several illustrations we will encounter of the much greater tightness 
of the Corollarv 1161 threshold over that in Corollary 1181 



8.2 Data Integration 

In our next application of MMD, we performed distribution testing for data integration: the 
objective being to aggregate two datasets into a single sample, with the understanding that 
both original samples were generated from the same distribution. Clearly, it is important to 
check this last condition before proceeding, or an analysis could detect patterns in the new 
dataset that are caused by combining the two different source distributions, and not by real- 
world phenomena. We chose several real-world settings to perform this task: we compared 
microarray data from normal and tumor tissues (Health status), microarray data from 
different subtypes of cancer (Subtype), and local field potential (LFP) electrode recordings 
from the Macaque primary visual co rtex (VI) with and without spike events (Neural Data I 
and II, as described in more detail by Rasch et al. . 20081 ) . In all cases, the two data sets have 



different statistical properties, but the detection of these differences is made difficult by the 
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high data dimensionahty (indeed, for the microarray data, density estimation is impossible 
given the sample size and data dimensionality, and no successful test can rely on accurate 
density estimates as an intermediate step). 

We applied our tests to these datasets in the following fashion. Given two datasets A 
and B, we either chose one sample from A and the other from B (attributes = different)] or 
both samples from either A or B (attributes = same). We then repeated this process up to 
1200 times. Results are reported in Table [H Our asymptotic tests perform better than all 
competitors besides Wolf: in the latter case, we have greater Type II error for one neural 
dataset, lower Type II error on the Health Status data (which has very high dimension 
and low sample size), and identical (error-free) performance on the remaining examples. 
We note that the Type I error of the bootstrap test on the Subtype dataset is far from its 
design value of 0.05, indicating that the Pearson curves provide a better threshold estimate 
for these low sample sizes. For the remaining datasets, the Type I errors of the Pearson 
and Bootstrap approximations are close. Thus, for larger datasets, the bootstrap is to be 
preferred, since it costs 0{m?), compared with a cost of O(m^) for Pearson (due to the cost 
of computing ()12p ). Finally, the uniform convergence-based tests are too conservative, with 
MMDf, finding differences in distribution only for the data with largest sample size, and 
MMD^ H never finding differences. 



Dataset 


Attr. 


MMDf, 


MMD^ H 


MMD^ B 


MMD^ M 


t-test 


Wolf 


Smir 


Hall 


Neural Data 1 


Same 


100.0 


100.0 


96.5 


96.5 


100.0 


97.0 


95.0 


96.0 




Different 


38.0 


100.0 


0.0 


0.0 


42.0 


0.0 


10.0 


49.0 


Neural Data 11 


Same 


100.0 


100.0 


94.6 


95.2 


100.0 


95.0 


94.5 


96.0 




Different 


99.7 


100.0 


3.3 


3.4 


100.0 


0.8 


31.8 


5.9 


Health status 


Same 


100.0 


100.0 


95.5 


94.4 


100.0 


94.7 


96.1 


95.6 




Different 


100.0 


100.0 


1.0 


0.8 


100.0 


2.8 


44.0 


35.7 


Subtype 


Same 


100.0 


100.0 


99.1 


96.4 


100.0 


94.6 


97.3 


96.5 




Different 


100.0 


100.0 


0.0 


0.0 


100.0 


0.0 


28.4 


0.2 



Table 1: Distribution testing for data integration on multivariate data. Numbers indicate 
the percentage of repetitions for which the null hypothesis (p=q) was accepted, 
given a = 0.05. Sample size (dimension; repetitions of experiment): Neural I 4000 
(63; 100) ; Neural II 1000 (100; 1200); Health Status 25 (12,600; 1000); Subtype 
25 (2,118; 1000). 



8.3 Computational Cost 

We next investigate the tradeoff between computational cost and performance of the various 
tests, with particular attention to how the quadratic time MMD tests from Sections [Hand [5] 
compare with the linear time MMD-based asymptotic test from Section [6l We consider two 
1-D datasets (CNUM and FOREST) and two higher-dimensional datasets (FORESTIOD 
and NEUROII). Results are plotted in Figure [5l If cost is not a factor, then the MMD^ B 
shows best overall performance as a function of sample size, with a Type II error dropping 
to zero as fast or faster than competing approaches in three of four cases, and narrowly 
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trailing FR Wolfm the fourth (FORESTIOD). That said, for datasets CNUM, FOREST, 
and FORESTIOD, the hnear time MMD achieves results comparable to MMD^ B at a far 
smaller computational cost, albeit by looking at a great deal more data. In the CNUM 
case, however, the linear test is not able to achieve zero error even for the largest data set 
size. For the NEUROII data, attaining zero Type II error has about the same cost for both 
approaches. The difference in cost of MMD^ B and MMD^ is due to the bootstrapping 
required for the former, which produces a constant offset in cost between the two (here 150 
resamplings were used). 

The t-test also performs well in three of the four problems, and in fact represents the best 
cost-performance tradeoff in these three datasets (i.e. while it requires much more data than 
MMD^ B for a given level of performance, it costs far less to compute). The t-test assumes 
that only the difference in means is important in distinguishing the distributions, and it 
requires an accurate estimate of the within-sample covariance; the test fails completely 
on the NEUROII data. We emphasise that the Kolmogorov-Smirnov results in 1-D were 
obtained using the classical statistic, and not the Friedman-Rafsky statistic, hence the low 
computational cost. The cost of both Friedman-Rafsky statistics is therefore given by the 
FR Wolf cost in this case. The latter scales similarly with sample size to the quadratic 
time MMD tests, confirming Friedman and Rafsky's observation that obtaining the pairwise 
distances between sample points is the dominant cost of their tests. We also remark on the 
unusual behaviour of the Type II error of the FR Wolf test in the FOREST dataset, which 
worsens for increasing sample size. 

We conclude that the approach to be recommended when testing homogeneity will 
depend on the data available: for small amounts of data, the best results are obtained using 
every observation to maximum effect, and employing the quadratic time MMD^ B test. 
When large volumes of data are available, a better option is to look at each point only once, 
which can yield greater accuracy for a given computational cost. It may also be worth doing 
a t-test first in this case, and only running more sophisticated non-parametric tests if the 
t-test accepts the null hypothesis, to verify the distributions are identical in more than just 
mean. 

8.4 Attribute Matching 

Our final series of experiments addresses automatic attribute matching. Given two databases, 
we want to detect corresponding attributes in the schemas of these databases, based on their 
data-content (as a simple example, two databases might have respective fields Wage and 
Salary, which are assumed to be observed via a subsampling of a particular population, 
and we wish to automatically determine that both Wage and Salary denote to the same 
underlying attribute) . We use a two-sample test on pairs of attributes from two databases 
to find corresponding pairs0 This procedure is also called table matching for tables from 
different databases. We performed attribute matching as follows: first, the dataset D was 
split into two halves A and B. Each of the n attributes in A (and B, resp.) was then rep- 
resented by its instances in A (resp. B). We then tested all pairs of attributes from A and 



11. Note that corresponding attributes may have different distributions in real-world databases. Hence, 
schema matching cannot solely rely on distribution testing. Advanced approaches to schema matching 
using MMD as one key statistical test are a topic of current research. 
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Figure 5: Linear vs quadratic MMD. First column is performance, second is runtime. The 
dashed grey horizontal line indicates zero Type II error (required due log y-axis) 
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from B against each other, to find the optimal assignment of attributes Ai, . . . , An from A 
to attributes Bi, . . . , from B. We assumed that A and B contain the same number of 
attributes. 

As a naive approach, one could assume that any possible pair of attributes might cor- 
respond, and thus that every attribute of A needs to be tested against all the attributes of 
B to find the optimal match. We report results for this naive approach, aggregated over 
all pairs of possible attribute matches, in Table [2j We used three datasets: the census 
income dataset from the U CI KDD archive (CNUM), th e protein homology dataset from 
the 2004 KDD C up fBIQ) dCaruana and Joachimj . a, and the forest dataset from the 



UCI ML archive ( Blake and Merz . 19981 ). For the final dataset, we performed univariate 



matching of attributes (FOREST) and multivariate matching of tables (FORESTIOD) from 
two different databases, where each table represents one type of forest. Both our asymptotic 
MMD^-based tests perform as well as or better than the alternatives, notably for CNUM, 
where the advantage of MMD^ is large. Unlike in Table [H the next best alternatives are not 
consistently the same across all data: e.g. in BIO they are Wolf or Hall, whereas in FOR- 
EST they are Smir, Biau, or the t-test. Thus, MMD^ appears to perform more consistently 
across the multiple datasets. The Friedman-Rafsky tests do not always return a Type I 
error close to the design parameter: for instance, Wolf has a Type I error of 9.7% on the 
BIO dataset (on these data, MMD^ has the joint best Type II error without compromising 
the designed Type I performance). Finally, MMDf, performs much better than in Table [H 
although surprisingly it fails to reliably detect differences in FORESTIOD. The results of 
MMD^ H are also improved, although it remains among the worst performing methods. 

A more principled approach to attribute matching is also possible. Assume that (j){A) = 
{(pi{Ai),(p2iA2), ■.■,(j)n{An))- in other words, the kernel decomposes into kernels on the 
individual attributes of A (and also decomposes this way on the attributes of B). In this 
case, MMD'^ can be written Y17=i Wl^ii^i) ~ where we sum over the MMD terms 

on each of the attributes. Our goal of optimally assigning attributes from B to attributes 
of A via MMD is equivalent to finding the optimal permutation vr of attributes of B that 
minimizes X]r=i ~ /"«(-^7r(j))lP- If '^^ define Cij = \\iJ,i{Ai) — fii{Bj)\\'^, then this is 

the same as minimizing the sum over Ci^T^f^iy T his is the lin ear assignment problem, which 
costs O(n^) time using the Hungarian method ( Kuhn , 19551 ). 



While this may appear to be a crude heuristic, it nonetheless defines a semi-metric on 
the sample spaces X and Y and the corresponding distributions p and q. This follows 
from the fact that matching distances are proper metrics if the matching cost functions are 
metrics. We formalize this as follows: 

Theorem 26 Let p,q be distributions on and denote by Pi,qi the marginal distribu- 
tions on the i-th variable. Moreover, denote by H the symmetric group on {1, . . . ,d}. The 
following distance, obtained by optimal coordinate matching, is a semi-metric. 



A[9^,p,q] :=min^MMD[:J,K, 



Tren ■ 

i=l 



Proof Clearly A[3", p, q] is nonnegative, since all of its summands are. Next we show the 

^P,g7 and TTp y. 



triangle inequality. Denote by r a third distribution on M.'^ and let vTp iTg^r and vTp^j. be the 
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distance minimizing permutations between p, q and r respectively. It then follows that 

d d 



A[J,p, q] + A[:r, q,r]=Y, MMD[J,pi, q^^^^^.^] + MMD[J, r,,,,(i)] 

i=\ i=l 
d 



i=l 

Here the first inequality follows from the triangle inequality on MMD, that is 
The second inequality is a result of minimization over tt. 



Datasct 


Attr. 


MMDfc 


MMD^ H 


MMD^ B 


MMD^ M 


t-test 


Wolf 


Smir 


Hall 


Biau 


BIO 


Same 


100.0 


100.0 


93.8 


94.8 


95.2 


90.3 


95.8 


95.3 


99.3 




Different 


20.0 


52.6 


17.2 


17.6 


36.2 


17.2 


18.6 


17.9 


42.1 


FOREST 


Same 


100.0 


100.0 


96.4 


96.0 


97.4 


94.6 


99.8 


95.5 


100.0 




Different 


3.9 


II. 


0.0 


0.0 


0.2 


3.8 


0.0 


50.1 


0.0 


CNUM 


Same 


100.0 


100.0 


94.5 


93.8 


94.0 


98.4 


97.5 


91.2 


98.5 




Different 


14.9 


52.7 


2.7 


2.5 


19.17 


22.5 


II.6 


79.1 


50.5 


FORESTIOD 


Same 


100.0 


100.0 


94.0 


94.0 


100.0 


93.5 


96.5 


97.0 


100.0 




Different 


86.6 


100.0 


0.0 


0.0 


0.0 


0.0 


I.O 


72.0 


100.0 



Table 2: Naive attribute matching on univariate (BIO, FOREST, CNUM) and multivariate 
data (FORESTIOD). Numbers indicate the percentage of accepted null hypothesis 
(p=q) pooled over attributes, a = 0.05. Sample size (dimension; attributes; 
repetitions of experiment): BIO 377 (1; 6; 100); FOREST 538 (1; 10; 100); CNUM 
386 (1; 13; 100); FORESTIOD 1000 (10; 2; 100). 



We tested this 'Hungarian approach' to attribute matching via MMD^ B on three 
univariate datasets (BIO, CNUM, FOREST) and for table matching on a fourth (FOR- 
ESTIOD). To study MMD^ B on structured data, we obtained two datasets of protein graphs 
(PRO TEINS and ENZYMES) and used the graph kernel for proteins from Borgwardt et al.] 
(|2005l l :br table matching via the Hungarian method (the other tests were not applicable 
to this graph data). The challenge here is to match tables representing one functional class 
of proteins (or enzymes) from dataset A to the corresponding tables (functional classes) in 
B. Results are shown in Table [31 Besides on the BIO and CNUM datasets, MMD^ B made 



no errors. 



9. Conclusion 

We have established three simple multivariate tests for comparing two distributions p and q, 
based on samples of size m and n from these respective distributions. Our test statistic is the 
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Dataset 


Data type 


No. attributes 


Sample size 


Repetitions 


% correct matches 


BIO 


univariate 


6 


377 


100 


90.0 


CNUM 


univariate 


13 


386 


100 


99.8 


FOREST 


univariate 


10 


538 


100 


100.0 


FORESTIOD 


multivariate 


2 


1000 


100 


100.0 


ENZYME 


structured 


6 


50 


50 


100.0 


PROTEINS 


structured 


2 


200 


50 


100.0 



Table 3: Hungarian Method for attribute matching via MMD^ B on univariate (BIO, 
CNUM, FOREST), multivariate (FORESTIOD), and structured data (EN- 
ZYMES, PROTEINS) (a = 0.05; '% correct matches' is the percentage of the 
correct attribute matches detected over all repetitions). 



maximum mean discrepancy (MMD), defined as the maximum deviation in the expectation 
of a function evaluated on each of the random variables, taken over a sufficiently rich function 
class: in our case, a universal reproducing kernel Hilbert space (RKHS). Equivalently, the 
statistic can be written as the norm of the difference between distribution feature means 
in the RKHS. We do not require density estimates as an intermediate step. Two of our 
tests provide Type I error bounds that are exact and distribution-free for finite sample 
sizes. We also give a third test based on quantiles of the asymptotic distribution of the 
associated test statistic. All three tests can be computed in 0((m + n)^) time, however 
when sufficient data are available, a linear time statistic can be used, which employs more 
data to get better results at smaller computational cost. In addition, a number of metrics 
on distributions (Kolmogorov-Smirnov, Earth Mover's, L2 distance between Parzen window 
density estimates), as well as certain kernel similarity measures on distributions, are included 
within our framework. 

While our result establishes that statistical tests based on the MMD are consistent for 
universal kernels on c ompact domains , we dr aw attention to the recent introduction of char- 



acteristic kernels bv iFukumizu et al 



( 20081 ). these being kernels for which the mean map 
is injective. Fukumizu et al. establish that Gaussian and Lapl ace kernels are charac teristic 
on M , and thus the MMD is a consistent test for this domain. ISriperumbudur et al. 1 (I2OO8) 
further explore the properties of characteristic kernels, providing a simple condition to de- 
termine whether convolution kernels are characteristic, and describing characteristic kernels 
which are not universal on compact domains. We also note (following Section [7. 2p that the 
MMD for RKHSs is associated with a particular kernel between probability distributions. 



Hein et al.l (|2004l ) describe several further such kernels, which induce corresponding dis- 



tances between feature space distribution mappings: these may in turn lead to new and 
powerful two-sample tests. 

Two recent studies have shown that additional divergence measures between distribu- 
tions c an be obtained e mpiri cally through optimiza tion in a repr o ducing kernel Hilbert 



space. 



Harchaoui et al. ( 20081 ) build on the work of Gretton et al. ( 2007a ). considering a 



homogeneity statistic arising from the ker nel Fisher discriminant, rather than the difference 
of RKHS means; and Nguyen et al. ( 20081 ) obtain a KL divergence estimate by approximat- 
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ing the ratio of densities (or its log) with a function in an RKHS. By design, both these 
kernel-based statistics prioritise different features of p and q when measuring the divergence 
between them, and the resulting effects on distinguishability of distributions are therefore 
of interest. 

Finally, we have seen in Section [2] that several classical metrics on probability distri- 
butions can be written as maximum mean discrepancies with function classes that are not 
Hilbert spaces, but rather Banach, metric, or semi-metric spaces. It would be of particular 
interest to establish under what conditions one could write these discrep ancies in terms of 
norms of differences of mean elements. In particular, Per and Lee (2003) consider Banach 
spaces endowed with a semi-inner product, for which a General Riesz Representation exists 
for elements in the dual. 



Appendix A. Large Deviation Bounds for Tests with Finite Sample 
Guarantees 

A.l Preliminary Definitions and Theorems 

We need the following theorem, due to McDiarmidI ( 19891 ). 



Theorem 27 (McDiarmid's inequality) Let f : X^' 

all i £ {1, ... , m}, there exist Cj < cxd for which 



be a function such that for 



sup . . . Xm) - f{xi, . . . Xi-i,X,Xi+i, . . .,Xni)\ < Ci- 



Then for all probability measures p and every e > 0, 

Px^ (/(x) - Exm(/(x)) > t) < exp 



2e" 



We also define the Rademacher average of the function class 3^ with respect to the m-sample 
X. 

Definition 28 (Rademacher average of 9" on X) Let 9" be the unit ball in a universal 
RKHS on the compact domain X, with kernel bounded according to < k{x,y) < K. Let X 
be an i.i.d. sample of size m drawn according to a probability measure p on X, and let ai be 
i.i.d and take values in {—1, 1} with equal probability. We define the Rademacher average 



RU3',X) := E.sup 



^ m 

— y'o-i/(a;i) 
m ^-^ 



i=l 



where the upper bound is due to lBartlett and Mendelsort (200m, Lemma 22). Similarly, we 
define 



i?m(3",p) := Ep^^sup 



1 

— y'cri/(a;i) 
m ^-^ 



i=l 
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A. 2 Bound when p and q May Differ 

We want to show that the absolute difference between MMD(J',p, q) and MMDf,(3", X, Y) is 
close to its expected value, independent of the distributions p and q. To this end, we prove 
three intermediate results, which we then combine. The first result we need is an upper 
bound on the absolute difference between MMD(3',p,q) and MMDb(3", X, y). We have 



\MMB{3',p,q) -MMBbi3',X,Y)\ 

m 1 " ^ 

E,(/)-E,(/)-lf;/(x.) + ^X;/(y^ 



< sup 



i=l 



n 



i=l 



(20) 



A(p,<7,x,y) 



Second, we provide an upper bound on the difference between A(p, g, X, Y) and its expec- 
tation. Changing either of Xi or yi in A(p, g, X, y) results in changes in magnitude of at 
most 2K^/'^ /m or 2X^/^/71, respectively. We can then apply McDiarmid's theorem, given 
a denominator in the exponent of 



m 



AK[- + - 

m n 



m + n 



mn 



to obtain 



Pr (A(p, g, X, Y) - Ex,y [A(p, g, X, Y)] > e) < exp 



e^mn 



2K[m + n) 



(21) 



For ou r final result, we exploit symmetrisation, following e.g. Ivan der Vaart and Wellner 
( 19961 . p. 108), to upper bound the expectation of A{p,q,X,Y). Denoting by X' an i.i.d 
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sample of size m drawn independently of X (and likewise for Y'), we have 
Ex,Y [A{p,q,X,Y)] 



Ex,y sup 

Ex,y sup 

/63^ 



< Ex,y,x',y' sup 

(a) /e3^ 



^X,Y,X',Y',a,a' SUp 



Ep(/)-;^f;/(^o-E,(/)+^x;/(^^) 

1=1 i=l 
\ i=l / i=l \ i=l / i=l 

^ E /(-^) - ^ E n-^) - - E /(y^) + ^ E f(y^ 

=1 i=l i=l i=l 



i=l 



i=l 



i=l 



< Ex,X'a sup 

< 2[Rm{3',p)+Rni3',q)]- 



+ Ey^y/^ sup 



^E^4/(?/;) -/(%■)) 



i=l 



< 2 

(d) 



(22) 



where (a) uses Jensen's inequality, (b) uses the triangle inequality, (c) substitutes Definition 
[28] (the Rademacher average), and (d) bounds the Rademacher averages, also via Definition 



Having established our preliminary results, we proceed to the proof of Theorem 1141 
Proof [Theorem [T3] Combining equations (j2ip and (j22p . gives 



Pr {^A{p, q, X,Y)-2 (K/m)^^^ + (K/n)^^^ 
Substituting equation ([20]) yields the result. 



> e 1 < exp 



2K{m + n) 



A. 3 Bound when p = q and m = n 

In this section, we derive the Theorem [TJ] result, namely the large deviation bound on the 
MMD when p = q and m = n. Note also that we consider only positive deviations of 
MMDf,(3", X, y) from MMD(3", p, g), since negative deviations are irrelevant to our hypoth- 
esis test. The proof follows the same three steps as in the previous section. The first step 
in (|20p becomes 

MMD;,(J,X,y) -MMD(g",p,g) = MMD;,( J, X, X') - 

= ^^pf;^E(/(^0-/(xD)V (23) 
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The McDiarmid bound on the difference between (j23p and its expectation is now a function 
of 2m observations in (j23p . and has a denominator in the exponent of 2m {2K^/'^ /rnj = 
SK/m. We use a different strategy in obtaining an upper bound on the expected (p3]) . 
however: this is now 



E 



X,X' 



m 



^ m 

sup-^(/(x.)-/(x^) 

m 



i=l 



-E 



m 



X,X' 



Xj) + A;(xj, Xj) k(xi,Xj) k(x^,Xj)^ 

i=i j=i 



< — \2mExk(x, x) + 2m(m - llEj. x'k(x, x') - 2rn?'E,x x'k(x, x')] 
m ' ' 



—Ex^x' (k{x,x) - k{x,x')) 
m 



< {2K/m) 



1/2 



(24) 
(25) 



We remark that both (j24p and (j25p bound the amount by which our biased estimate of the 
population MMD exceeds zero under 'Kq. Combining the three results, we find that under 

;Ko, 



px MMDb(J,X,X') 



m 



Ex,x' {k{x,x) - k{x,x')) 



> e < exp 



px ( MMDb(:T, X, X') - (2Klmfl'^ > e ) < exp 



2 

-em 
4K 

-e'^m^ 
4K . 



and 



Appendix B. Proofs for Asymptotic Tests 

We derive results needed in the asymptotic test of Section O Appendix IB. II describes the 
distribution of the empirical MMD under !Ko (both distributions identical). Appendix IB. 21 
contains derivations of the second and third moments of the empirical MMD, also under 

B.l Convergence of the Empirical MMD under "Kq 

We describe the distribution of the unbiased estimator MMD^ [[F, X, y ] under the null hy- 
pothesis. In this circumstance, we denote it by MMD^[9", X, X'], to emphasise that the 
second sample X' is drawn independently from the same distribution as X. We thus obtain 
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the U-statistic 

MMBl[3^,X,X'] = ^ y^k{x„xj) + k{x[,x'j)-k{x„x'j)-k{xj,x'i) (26) 

mlm — 1 ^-^ ■' J J 

^+3 



= -r^—r,y,Kz'^,^j). (27) 
mim — 1) ^-^ 

where = (xj,a;.). Under the null hypothesis, this U-statistic is degenerate, meaning 

E^^./i(zi,Zj) = Y^^,^k(xi,Xj) + E^/ A;(x-,x^-) - E^/ A;(xj, a;^-) - E^^. ^(xj, x-) 
= 0. 

The following theorem from lSerfling (|l98d . Section 5.5.2) then applies. 

Theorem 29 Assume MMD2[J, X, X'] is as defined in ^21^ , with Ez'h{z, z') = 0, and fur- 
thermore assume < E^,^//i2(z,z') < oo. Then MMD^ [J, X, X'] converges in distribution 
according to 

oo 

mMMI)l[3^,X,X'] {xli - l) , 

1=1 

where xfi o,re independent chi squared random variables of degree one, and 7/ are the solu- 
tions to the eigenvalue equation 

jlipl{u) = j h{u,v)ipi{v)dFT . 

While this result is adequate for our purposes (since we do not explicitly use the quantities 7/ 
in our subsequent reasoning) , it does not make clear the dependence of the null distribution 
on the kern el choice. For this rea son, we provide an alternative expression based on the 
reasoning of Anderson et al] ( 1994 . Appendix), bearing in mind the following changes: 



• we do not need to deal with the bias terms Sij seen by lAnderson et al.l (|1994| . Ap- 
pendix) that vanish for large sample sizes, since our statistic is unbiased (although 
these bias terms drop faster than the variance); 

• we require greater generality, since we deal with distributions on compact metric 
spaces, and not densities on W^; correspondingly, our kernels are not necessarily inner 
products in L2 between probability density functions (although this is a special case). 

Our first step is to express the kernel h{zi, zj) of the U-statistic in terms of an RKHS kernel 
k{xi,Xj) between feature space mappings from which the mean has been subtracted, 

k{xi,Xj) := {(l){xi) - fi[p],(l){xj) - fi[p]) 

= k{xi, Xj) - E^/c(xj, x) - Ea;/c(x, Xj) + Ea,^a,//c(x, x'). 

The centering terms cancel (the distance between the two points is unaffected by an identical 
global shift in both the points), meaning 

h{zi,Zj) = k{xi,Xj) + k{yi,yj) - k{xi,yj) - k{xj,yi). 
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Next, we write the kernel k{xi,Xj) in terms of eigenfunctions il^i{x) with respect to the 
probabiUty measure Pr^;, 

oo 

k{x,x') = '^Xitl;i{x)ipi{x'), 
1=1 

where 

/ k{x,x')ilji{x)dFT{x) = Xiipiix') 

and 

V^i(x)V'j(a;)rfPr(x) = (28) 



X 



Since 



^xk{x, v) = Ea;/c(x, v) - Ea.^a.//c(x, x') - Yjxk{x, v) + E^.^^/fc(x, x) 

= 0, 

then when Aj 7^ 0, we have that 

\iExi'4)i{x') = / 'Exik{x,x')'ilJi{x)dY'r{x) 
= 0, 

and hence 

E,Vi(rc) = 0. (29) 
We now use these results to transform the expression in ()26p . First, 



i^i ii^j 1=1 



— ^k{xi,Xj) = 

i^j 1=1 

1=1 \ \ i / i 



D 

1=1 



where yi ~ 3Nf(0, 1) are i. i.d., and t he fin al relation denotes convergence in distribution, using 
dlHD and ([29]), following [Serflingl (|l98d . Section 5.5.2). Likewise 



m ^ — ^ -J n ^ — ^ 



m ^ ■' D 

i^j 1=1 



where zi ~ !N(0, 1), and 

^ 00 
-^—-r'^(k{xi,yj) + k{xj,yi)^ 2^\iyizi. 



m(m 
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Combining these results, we get 

oo 

1=1 

oo 

= ^ A4(y, - - 2] . 
/=i 

Note that yi — zi, being the difference of two independent Gaussian variables, has a normal 
distribution with mean zero and variance 2. This is therefore a quadratic form in a Gaussian 
random variable minus an offset 2^^^ A;. 

B.2 Moments of the Empirical MMD Under JCq 

In this section, we compute the moments of the U-statistic in Section [H under the null 
hypothesis conditions 

E,,,,/i(z,z') = 0, (30) 

and, importantly, 

B;,,h{z,z') = 0. (31) 

Note that the latter implies the former. 

Varianc e /2nd mome nt: This was derived by Hoeffding ( 19481 . p. 299), and is also 
described bv ISerfling (1980, Lemma A p. 183). Applying these results, 



E ( [MMD^ 



2l2 



2 



n{n — 1) 



) - 2)(2)E, [(E,,M^, ^')f] + ^^^^E,,,, [h\z, z')] 



^i^E, [{^,h{z,z')f] + -r^E.,.' [h\z,z') 
n[n — Ij ^ n[n — 1) 



n{n — 1) 

where the first term in the penultimate line is zero due to (13ip . Note that variance and 2nd 
moment are the same under the zero mean assumption. 

3rd moment: We consider the terms that appear in the expansion of E ^[MMD^]^ 
These are all of the form 



2 ^3 



nin — 1 , 



^{habhcdhef), 



where we shorten hat = h{za,Zb), and we know Za and Zf, are always independent. Most of 
the terms vanish due to ()30p and (j3ip . The first terms that remain take the form 

3 

^{habhbch 

ca)i 



n{n — 1) 
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and there are 

of them, which gives us the expression 

3 



n{n — 1) 



= J|^_f)2 E^.^' [^(^' ^')E." {hiz, z")h{z', z"))] . (32) 

Note the scahng ~ The remaining non-zero terms, for which a = c = e and 

b = d = f , take the form 



and there are "^"^ of them, which gives 



-)'e,,,, [hHz,z')] 



(33) 



However {^jj^z^^ ~ n ^ so this term is negligible compared with (j32p . Thus, a reasonable 
approximation to the third moment is 

E ([MMD^]^) ^ J|L^E,,,, [M^,^')E." (Mz,z")M^',^"))] • 
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